Goto

Collaborating Authors

 response space


Deployment-complete benchmarking

arXiv.org Machine Learning

Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.




Rethinking Cross-lingual Gaps from a Statistical Viewpoint

arXiv.org Artificial Intelligence

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models. Large Language Models (LLMs) have revolutionized information access. Central to LLM's mission is to assimilate knowledge universally and make it available generally without any barriers. State-of-art LLMs are multilingual: Gemini supports over 40 languages (Gemini, 2025), GPT -5 supports at least 12 languages (GPT, 2025) (with no official number of supported languages) and open-source models like Gemma-3 support over 100 spoken languages (Gemma, 2025). Because pretraining data cannot contain duplicate information for every language, cross-lingual generalization is a necessary capability for LLMs. However, LLMs are known to have disparity in recalling knowledge across languages (Jiang et al., 2020; Kassner et al., 2021; Qi et al., 2023; Chua et al., 2024a; Goldman et al., 2025). Our objective is to understand the causes of poor transfer of knowledge encoded in parameters across languages. We, therefore, evaluate models on knowledge-intensive tasks in a closed-book QA setting, i.e., without access to such tools as grounding in search. Cross-lingual gaps are quantified through disparity on parallel datasets that alter language-specific surface form of the prompts.


A Expression for the marginal and conditional distributions

Neural Information Processing Systems

Hz for scan 2. The recorded visual Hyper-parameters included the learning rate and the regularization coefficient on the readout weights. A single ZIFFA model takes approximately 2-3 hours to train whereas all other models take approximately 20-30 minutes to train. The hy-perparameter search was completed using one GPU for a total of ~20 hours. We estimated the posterior mean of the neuron's responses to an image We then inverse-transformed these samples to yield samples in the space of the neural responses. More specifically, for the FA-based models (except ZIFFA, see below), the posterior mean of the neuron's original response to image To ensure that generated Gaussian samples (1) fall in a range where the transformation is invertible and that they (2) cover the most nonlinear part of the transformation, we kept the variances and covariances relatively small and sampled the mean for each neuron in a transform-specific fashion.



Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities

arXiv.org Artificial Intelligence

Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.


Accelerating RLHF Training with Reward Variance Increase

arXiv.org Artificial Intelligence

Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel $O(n \log n)$ algorithm to find a global solution of the nonconvex reward adjustment model by explicitly characterizing the extreme points of the feasible set. As an important application, we naturally integrate this reward adjustment model into the GRPO algorithm, leading to a more efficient GRPO with reward variance increase (GRPOVI) algorithm for RLHF training. As an interesting byproduct, we provide an indirect explanation for the empirical effectiveness of GRPO with rule-based reward for RLHF training, as demonstrated in DeepSeek-R1. Experiment results demonstrate that the GRPOVI algorithm can significantly improve the RLHF training efficiency compared to the original GRPO algorithm.


Generalised Boosted Forests

arXiv.org Machine Learning

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family $\mathbb{E}[Y|X] = g^{-1}(f(X))$ our goal is to obtain an estimate for $f$. We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.


Deep Learning Reveals Underlying Physics of Light-matter Interactions in Nanophotonic Devices

arXiv.org Machine Learning

In this paper, we present a deep learning-based (DL-based) algorithm, as a purely mathematical platform, for providing intuitive understanding of the properties of electromagnetic (EM) wave-matter interaction in nanostructures. This approach is based on using the dimensionality reduction (DR) technique to significantly reduce the dimensionality of a generic EM wave-matter interaction problem without imposing significant error. Such an approach implicitly provides useful information about the role of different features (or design parameters such as geometry) of the nanostructure in its response functionality. To demonstrate the practical capabilities of this DL-based technique, we apply it to a reconfigurable optical metadevice enabling dual-band and triple-band optical absorption in the telecommunication window. Combination of the proposed approach with existing commercialized full-wave simulation tools offers a powerful toolkit to extract basic mechanisms of wave-matter interaction in complex EM devices and facilitate the design and optimization of nanostructures for a large range of applications including imaging, spectroscopy, and signal processing. It is worth to mention that the demonstrated approach is general and can be used in a large range of problems as long as enough training data can be provided.